(improvement) Optimize VectorType deserialization with struct.unpack and numpy (us level improvements - 2-13x speedup - Python path only!)#730
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes VectorType (de)serialization in cassandra/cqltypes.py by introducing bulk numeric (de)serialization via a cached struct.Struct, and an optional numpy-based deserialization fast path for larger vectors.
Changes:
- Cache a per-parameterized-vector
struct.Structto bulkunpack/packcommon numeric vector subtypes. - Add an optional numpy
frombuffer(...).tolist()deserialization fast-path for vectors withvector_size >= 32. - Refactor variable-size vector deserialization to a fixed-iteration loop with stricter bounds checks.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ct.unpack
Add bulk deserialization using struct.unpack for common numeric vector types
instead of element-by-element deserialization. This provides significant
performance improvements, especially for small vectors and integer types.
Optimized types:
- FloatType ('>Nf' format)
- DoubleType ('>Nd' format)
- Int32Type ('>Ni' format)
- LongType ('>Nq' format)
- ShortType ('>Nh' format)
Performance improvements (measured with CASS_DRIVER_NO_CYTHON=1):
Small vectors (3-4 elements):
Vector<float, 3> : 0.88 μs → 0.25 μs (3.58x faster)
Vector<float, 4> : 0.78 μs → 0.28 μs (2.79x faster)
Medium vectors (128 elements):
Vector<float, 128> : 4.72 μs → 4.06 μs (1.16x faster)
Vector<double, 128> : 4.83 μs → 4.01 μs (1.20x faster)
Vector<int, 128> : 2.27 μs → 1.25 μs (1.82x faster)
Large vectors (384-1536 elements):
Vector<float, 384> : 15.38 μs → 14.67 μs (1.05x faster)
Vector<float, 768> : 32.43 μs → 30.72 μs (1.06x faster)
Vector<float, 1536> : 63.74 μs → 63.24 μs (1.01x faster)
The optimization is most effective for:
- Small vectors (3-4 elements): 2.8-3.6x speedup
- Integer vectors: 1.8x speedup
- Medium-sized float/double vectors: 1.2-1.3x speedup
For very large vectors (384+ elements), the benefit is minimal as the
deserialization time is dominated by data copying rather than function
call overhead.
Variable-size subtypes and other numeric types continue to use the
element-by-element fallback path.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
For vectors with 32 or more elements, use numpy.frombuffer() which provides 1.3-1.5x speedup for large vectors (128+ elements) compared to struct.unpack. The hybrid approach: - Small vectors (< 32 elements): struct.unpack (2.8-3.6x faster than baseline) - Large vectors (>= 32 elements): numpy.frombuffer().tolist() (1.3-1.5x faster than struct.unpack) Threshold of 32 elements balances code complexity with performance gains. Benchmark results: - float[128]: 2.15 μs → 1.87 μs (1.15x faster) - float[384]: 6.17 μs → 4.44 μs (1.39x faster) - float[768]: 12.25 μs → 8.45 μs (1.45x faster) - float[1536]: 24.44 μs → 15.77 μs (1.55x faster) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
c417e73 to
0535ecd
Compare
…ated method dispatch Cache subtype.serial_size() and the full vector serial_size() as class attributes (_subtype_serial_size, _serial_size) during apply_parameters(). This eliminates per-call method dispatch overhead in serialize(), deserialize(), and serial_size() hot paths. serial_size() call: 99ns -> 46ns (2.2x faster) Attribute access: 54ns -> 17ns (3.2x faster)
|
@mykaul This is not a draft, but review was not requested. Please either change to draft, or request review. |
It's an improvement, not a fix. I believe it's ready, but I don't want to disrupt the team. I'm not sure what to do (and I do it for fun anyway). If there's anything that I see as important - I'm not shy. |
Summary
struct.unpackfor known numeric types (float, double, int32, int64, short), caching astruct.Structobject at type-creation timenp.frombuffer().tolist()) for vectors with >= 32 elementsserial_size()results to eliminate per-call method dispatch overheadKeyErrorcatch, wrapsubtype.deserializefailures with element context and proper exception chainingPerformance (pure Python, best of 5)
Deserialization:
Vector<float, 4>Vector<float, 16>Vector<float, 128>Vector<float, 768>Vector<float, 1536>Serialization:
Vector<float, 4>Vector<float, 16>Vector<float, 128>Vector<float, 768>Vector<float, 1536>serial_size() overhead:
serial_size()call (768-dim)Details
Commit 1 -- struct.unpack optimization + variable-size path fixes:
apply_parameters()time, cache astruct.Struct('>Nf')for the vector's subtype+dimensiondeserialize()callslist(struct.unpack(byts))-- single C-level bulk unpackstruct.pack(*v)KeyErrorfrom except clause (uvint_unpackonly raisesIndexError), wrapsubtype.deserializefailures inValueErrorwith element index and proper exception chaining (from e)Commit 2 -- numpy for large vectors:
np.frombuffer(byts, dtype='>f4', count=N).tolist().tolist()batch-converts with better cache locality_numpy_dtypecached on the class at type-creation time (no per-call dict construction)Commit 3 -- serial_size caching:
subtype.serial_size()result as_subtype_serial_sizeand the full vector serial size as_serial_sizeduringapply_parameters()serial_size()returns cached value directly (no method dispatch chain)serialize()anddeserialize()usecls._subtype_serial_sizeinstead of callingcls.subtype.serial_size()each timeAll three commits modify only
cassandra/cqltypes.py. No Cython dependency.